Every HTML5 document contains the following dtd-less DOCTYPE declaration (see attachment):
<!DOCTYPE html>
This makes the XML parser spit out a lot of errors about undeclared elements. The
XML spec says that a document with a dtd-less DOCTYPE declaration can still be well-formed.
Since this is a very common case, it should be handled more gracefully. If there is
no (internal or external) DTD, it would be more useful to stop validating and only
report errors in well-formedness, so these do not get lost.
See also http://www.w3.org/TR/html51/syntax.html#the-doctype
and http://www.w3.org/TR/REC-xml/#NT-doctypedecl
Submitted | will69 - 2015-12-22 09:03:20.526000 | Assigned | kerik-sf |
---|---|---|---|
Priority | 5 | Labels | |
Status | pending | Group | |
Resolution | fixed |
2015-12-22 09:11:15.312000 will69 |
|
---|---|
2016-01-31 15:21:18.153000 kerik-sf |
Long story short: it doesn't seem possible to disable dtd validation on the fly without
seriously hacking Xerces-J. So I have to interrupt parsing and reparse when detecting
the empty html doctype.
|
2016-02-09 11:43:20.006000 will69 |
Hi Eric. Thanks for following up on this! There is an HTML and an XHTML syntax for
HTML5. The XHTML variant is, of course, an application of xml. UTF-8/UTF-16 is the
default encoding for any xml application. [See here](https://wiki.whatwg.org/wiki/HTML_vs._XHTML)
for a comparison of HTML5 and XHTML5. So is this actually a problem with Xerces and
should be filed upstream? What about reading the first 15 characters and switching
validation off, before invoking the parser? jEdit reads the first line of a file to
determine the file type anyway, doesn't it?
|
2016-02-09 21:45:08.404000 kerik-sf |
Hi will69,
|
2016-02-10 11:09:04.618000 will69 |
This works great! It even starts validating again as soon as I use an internal subset.
|
2016-02-10 18:03:51.768000 kerik-sf |
Good to know that it works for you.
|
2016-02-12 21:52:34.101000 kerik-sf |
- **status**: open --> pending-fixed |
2016-02-12 21:52:34.563000 kerik-sf |
will be in next release |